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Abstract. We describe a set of bilingual English-Lrench and English-German 
parallel corpora in which the direction of translation is accurately and reliably 
annotated. The corpora are diverse, consisting of parliamentary proceedings, lit¬ 
erary works, transcriptions of TED talks and political commentary. They will 
be instrumental for research of translationese and its applications to (human and 
machine) translation; specifically, they can be used for the task of translationese 
identification, a research direction that enjoys a growing interest in recent years. 
To validate the quality and reliability of the corpora, we replicated previous re¬ 
sults of supervised and unsupervised identification of translationese, and further 
extended the experiments to additional datasets and languages. 


1 Introduction 

Research in all areas of language and linguistics is stimulated by the unprecedented 
availability of data. In particular, large text corpora are essential for research of the 
unique properties of translationese: the sub-language of translated texts (in any given 
language) that is presumably distinctly different from the language of texts originally 
written in the same language. Indeed, contemporary research in translation studies is 
prominently dominated by corpus-based approach [1-9]. Most studies of translationese 
utilize monolingual comparable corpora, i.e. corpora where translations from multiple 
languages into a single language are compared with texts written originally in the target 
language. 

The unique characteristics of translated language have been traditionally classified 
into two categories: properties that stem from the interference of the source language 
[10], and universal traits elicited from the translation process itself, regardless of the 
specific source and target language-pair [1,11]. Computational investigation of trans¬ 
lated texts has been a prolific field of recent research, laying out an empirical foundation 
for the theoretically-motivated hypotheses on the characteristics of translationese. More 
specifically, identification of translated texts by means of automatic classification shed 
much light on the manifestation of translation universals and interference phenomena 
in translation [12-20]. 

Along the way it was suggested that the unique properties of translationese should 
be studied in a parallel setting, i.e., in context of the corresponding source language: the 


original language the text was produced in. In particular, several studies hypothesized 
that certain phenomena traditionally attributed to translation universals (i.e., source- 
language independent) are, in fact, derivatives of the linguistic characteristics of the 
specific language-pair, subject to translation. [21] investigated the phenomenon of omit¬ 
ting the optional “that” in reporting English verbs, such as “he claimed [that] they left 
the room”, highlighting correlation of this behavior to the linguistic conventions in the 
source language. [22] raised similar arguments regarding explicitation in translation 
(e.g., excessive usage of cohesive devices); he claimed that this phenomenon should 
only be studied by comparative analysis of translation and its original counterpart. 

Parallel setting can also facilitate the task of automatic identification of transla- 
tionese. [14] were the first to employ bilingual text properties for identification of the 
direction of translated parallel texts. They took advantage of sentence pairs translated 
in both directions for training a supervised classifier to identify translationese using 
word- and (part-of-speech) POS-ngrams as features. [23] used POS MTU (minimal 
translation unit) ngrams and HMM distortion properties extracted from bilingual par¬ 
allel English-French Europarl and Hansard texts. Consecutively, they carried out series 
of experiments on sentence-level identification of translationese using Brown clusters 
[24] MTUs on the Hansard corpus [25]. 

A good corpus for research into the properties of translationese, should ideally sat¬ 
isfy the following desiderata: 

Diversity The corpus should ideally reflect diverse genres, registers, authors, modality 
(written vs. spoken) etc. 

Parallelism The corpus should include both the source and its translation, so that fea¬ 
tures that are revealed in the translation can be traced back to their origins in the source. 
Multilinguality Having translations from several source languages to the same target 
language facilitates a closer inspection of properties that are language-pair-specific vs. 
more “universal” features of translationese [1,26,27,22]. 

Uniformity Whatever processing is done on the texts, it must be done uniformly. This 
includes sentence boundary detection, tokenization, sentence- and word-alignment, POS 
tagging, etc. 

Availability Finally, corpora that are used for research must be publicly available so 
that other researchers have the opportunity to replicate and corroborate research results. 

In this work we describe a set of cross-domain, parallel, uniform, English-French 
and English-German corpora that were compiled specifically for research on trans- 
lationese^. The corpora are diverse, consisting of parliamentary proceedings, literary 
works, transcriptions of TED talks and political commentary. We rigorously evaluate 
all datasets by series of supervised and unsupervised experiments; sensitivity analysis 
further implies applicability of these methodologies to data-meager scenarios. These 
datasets will be instrumental for research of translationese and its manifestations; they 
will also facilitate accurate identification of translationese at small text units by ex¬ 
ploitation of bilingual text properties. 


^ All corpora are available at http://cl.haifa.ac.il/projects/ 
translationese/index.shtml 



We detail the structure of the corpus in Section 2, explain how it was processed in 
Section 3, and evaluate it by extending some state-of-the-art supervised and unsuper¬ 
vised experiments in Section 4. We conclude with suggestions for future extensions. 


2 Corpus structure 

Our corpus of translationese consists of five sub-corpora: Europarl, Canadian Hansard, 
literature, TED and political commentary."^ All are parallel corpora, with accurate an¬ 
notation indicating the direction of the translation. The datasets are uniformly pre- 
processed, represented, and organized. All corpora were further filtered to contain solely 
one-to-one sentence-alignments, which are more useful for the SMT research. Tables 1 
and 2 report some statistical data on the corpus (after tokenization). 


Table 1. English-French corpus statistics 



1 # of sentences 

# of tokens 

# of types 

corpus 

1 Original EN Original FR 

total 

Original EN Original FR 

Original EN Original FR 

EUR 

217K 

130K 

347K 

9,572K 10,542K 

61K 

73K 

HAN 

5,237K 

1,379K 6,616K 

132,232K 147,463K 

193K 

196K 

LIT 

35K 

98K 

133K 

2,875K 2,898K 

52K 

66K 

TED 

7K 

4K 

12K 

217K 239K 

14K 

17K 


Table 2. English-German corpus statistics 



# of sentences 

# of tokens 

# of types 

corpus 

Original EN Original DE total 

Original EN Original DE 

Original EN Original DE 

EUR 

225K 

155K 380K 

10,550K 

10,067K 

63K 170K 

LIT 

45K 

48K 93K 

2,854K 

2,666K 

56K 104K 

POL 

8K 

9K 18K 

443K 

421K 

26K 44K 


2.1 Europarl 

The Europarl sub-corpus is extracted from the collection of the proceedings of the Euro¬ 
pean Parliament, dating back to 1996, originally collected by [28]. The original corpus^ 
is organized as several language-pairs, each with multiple sentence-aligned files. We 
mainly used the English-Erench and English-German segments, but resorted to other 
segments as we presently explain. We focus below on the way we generated the English- 
Erench sub-corpus; its English-German counterpart was obtained in a similar way. 

We use “EUR”, “HAN”, “LIT”, “TED” and “POL” to denote the five corpora hereafter. 

^ The original Europarl is available from http: / /www .statmt.org/europarl/ 



Europarl is probably the most popular parallel corpus in natural language process¬ 
ing, and it was indeed used for many of the translationese tasks surveyed in Section 1. 
Unfortunately, it is a very problematic corpus. First, it consists of transcriptions of spo¬ 
ken utterances that are edited (by the speakers) after they are transcribed; only then 
are they translated. Consequently, there are significant discrepancies between the actual 
speeches and their “verbatim” transcriptions [29]. Second, while “Members of the Eu¬ 
ropean Parliament have the right to use any of the EU’s [24] official languages when 
speaking in Parliament”,® many of them prefer to speak in English, which is often not 
their native language.^ 

Mainly due to its multilingual nature, however, Europarl has been used extensively 
in SMT [30] and in cross-lingual research [31 ]. It has even been adapted specifically for 
research in translation studies: [32] compiled a customized version of Europarl, where 
the direction of translation is indicated. They used meta-data from the corpus, and in 
particular the “language” tag, to identify the original language in which each sentence 
was spoken, and removed sentence pairs for which this information was missing. A 
similar strategy was used by [33] and [34]. However, relying on the “language” tag 
in Europarl parallel text for identification of translation direction could be potentially 
flawed. Next we detail the procedure of reliable extraction of speaker details, including 
the original language of each sentence. 

The Europarl corpus is a collection of several monolingual (parallel) corpora: the 
original text was uttered in one language and then translated to several other languages. 
In each sub-corpus, each paragraph is annotated with meta-information, in particular, 
the original language in which the paragraph was uttered. Unfortunately, the meta¬ 
information pertaining to the original language of Europarl utterances is frequently 
missing. Furthermore, in some cases this information is inconsistent: different lan¬ 
guages are indicated as the original languages of (various translations of) the same 
paragraph (in the various sub-corpora). Additionally, the Europarl corpus includes sev¬ 
eral bilingual sub-corpora that are generated from the original and the translated texts, 
and are already sentence-aligned. These bilingual corpora include only raw sentence 
pairs, with no meta-information. 

To minimize the risk of erroneous information, we processed the Europarl corpus 
as follows. First, we propagated the meta-information from the monolingual texts to the 
bilingual sub-corpora: each sentence pair was thus annotated with the original language 
in which it was uttered. We repeated this process five times, using as the source of meta¬ 
information the original monolingual corpora in five languages: English, French, Ger¬ 
man, Italian, and Spanish (note that not all monolingual corpora are identical: some are 
much larger than others). For the same reason, not all the English-French sentence pairs 
in our bilingual corpus are reflected in all five monolingual corpora, and therefore some 
sentence pairs have less than five annotations of the original language. We restricted 
the bilingual corpus to only those sentence pairs that had five annotations. Then, we fil¬ 
tered out all sentence pairs whose annotations were inconsistent (about 0.5%). We also 

® http://europa.eu/about-eu/facts-figures/administration/index_ 
en.htm 

http: / / WWW. theguardian . com/education/ datab log/2014 /may/21/ 
european-parliament-english-language-official-debates-data 



removed comments (about 0.5% as well), typically written in parentheses (things like 
“applause”, “continuation of the previous session”, etc.) As a result, we are confident 
that the speaker information (and in particular, the original language of utterances) in 
the filtered corpus is highly accurate. 


2.2 The Canadian Hansard 

The Hansard corpus is a parallel corpus consisting of transcriptions of the Canadian par¬ 
liament in (Canadian) English and French from 2001-2009. We used a version that was 
annotated with the original language of each parallel sentence. This corpus most likely 
suffers from similar problems as the Europarl corpus discussed above; indeed, [35], who 
investigated the British Hansard parliamentary transcripts, found that “the transcripts 
omit performance characteristics of spoken language, such as incomplete utterances or 
hesitations, as well as any type of extrafactual, contextual talk” and that “transcribers 
and editors also alter speakers’ lexical and grammatical choices towards more conser¬ 
vative and formal variants.” Still, this is the largest available source of English-French 
sentence pairs. In addition to parliament members’ speech, the original Hansard corpus 
contains metadata. Various annotations were used to discriminate different line types, 
including the date of the session, the name of the speaker, etc. We filtered out all seg¬ 
ments except those referring to speech; in total, about 15% of the corpus line-pairs were 
thus eliminated. 


2.3 Literary Classics 

Our English-French literary corpus consists of classics written and translated in the 
18th-20th centuries by English and French writers. Most of the raw material is avail¬ 
able from the Gutenberg project* and FarkasTranslations.® The English-German lit¬ 
erature corpus was generated in a similar way: we used material from the Guten¬ 
berg project. Wikisource,and a few more books. Both English-French and English- 
German datasets contain a metadata file with details about the books: title, year of 
publication, translator name and year of translation. 

Identification of translationese in literary text by means of classification is consid¬ 
ered a more challenging task [36,37] than classifying more “technical” translations, 
such as parliament proceedings. Translators of literature typically benefit from freedom 
and fewer constraints, rendering the translated text more similar to original writing. 
Additionally, our literature span almost three centuries and comprises works from wide 
range of genres - traits that overshadow the subtle characteristics of translationese [15, 
38,39,19]. Under this circumstances, we obtain very high accuracy with supervised 
classification on this corpus, and moderate, yet reasonable results with unsupervised 
clustering (see Section 4.3). 


* http:// WWW .gutenberg.org 
® http://farkastranslations.com/ 
http://en.wikisource.org/ 



2.4 TED Talks 


Our TED corpus is based on the subtitles of the TED talks delivered in English and 
translations to English of TEDx talks originally given in Erench". We used the TED 
APl'^ to extract subtitles of talks delivered in English, and Youtube API for TEDx talks 
originally given in Erench. 

The quality of translation in this corpus is very high: not only are translators as¬ 
sumed to be competent, but the common practice is that each translation passes through 
a review before being published. This corpus consists of talks delivered orally, but we 
assume that they were meticulously prepared, so the language is not spontaneous but 
rather planned. Compared to the other sub-corpora, the TED dataset has some unique 
characteristics that stem from the following reasons: (i) its size is relatively small; (ii) it 
exhibits stylistic disparity between the original and translated texts (the former contains 
more “oral” markers of a spoken language, while the latter is a written translation); and 
(iii) TED talks are not transcribed but are rather subtitled, so they undergo some edit¬ 
ing and rephrasing. The vast majority of TED talks are publicly available online, which 
makes this corpus easily extendable for future research. 


2.5 Political News and Commentary 

This corpus contains articles, commentary and analysis on world affairs and interna¬ 
tional relations. English articles and their translations to German were collected from 
Project Syndicate.'^ This is a non-profit organization that primarily relies on contri¬ 
butions from newspapers in developed countries. It provides original commentaries 
by people who are shaping the world’s economics, politics, science and culture. We 
collected articles categorized as Word Affairs from this project, originally written by 
English authors and translated to German. Original German commentaries and their 
translations to English were collected from the Diplomatics Magazine,'^ specifically, 
from its International Relations section. 

3 Processing 

The original Europarl corpora are already sentence-aligned, using an implementation 
of the Gale and Church sentence-alignment algorithm [40]. Since the alignment was 
done for one source paragraph at a time (typically consisting of few sentences), its 
quality is very high. The same also holds for the Hansard corpus, so we used the original 
alignments for both sub-corpora. We then filtered out any alignments that were not one- 
to-one; this resulted in a loss of about 3% of the alignments in Europarl, and only 2% 
in Hansard. 

** TEDx are TED-like events not restricted to specific language. We could not find sufficient 
amount of TEDx German talks translated to English. 

http://developer.ted.com/ 

^ http://WWW.project-syndicate.org/ 

^ http://WWW.diplomatisches-magazin.de/ 



The literary sub-corpus required more careful attention. Books that were acquired 
from FarkasTranslations . com were available pre-aligned at the chapter- and 
paragraph-level; we therefore sentence-aligned them, one paragraph at a time, using a 
Python implementation [41] of the Gale and Church algorithm. For the remainder of the 
books, we first extracted chapters by (manually) identifying characteristic chapter titles 
(e.g., Roman numerals, explicit “Chapter N”). Paragraph boundaries within a chapter 
are typically marked by a double newline in Gutenberg transcripts, and we used this 
pattern to break chapters into paragraphs. Due to the fact that the Gale-Church algorithm 
only utilizes text length for alignment, it can be easily refined for aligning other logical 
units, e.g., paragraphs [40]. Finally, we aligned sentences within paragraphs using the 
same algorithm. 

The genre of the literature sub-corpus is very different (presumably due to trans¬ 
lators taking greater liberty), hence restricting the dataset to include only one-to-one 
sentence-alignments resulted in loss of above 10% of each book. 

Sentence-alignment of subtitles of TED talks originally delivered in French (and 
translated to English) involved synchronization of subtitle frames. A typical frame in a 
subtitles (.srt) file contains frame start and end time (including milliseconds), as well as 
frame text: 


18 frame sequential number 

00:00:47,497 —> 00:00:50,813 frame start and end time 

Cet engagement, je pense que j'ai fait le choix frame text 


Eirst, we re-organized the subtitles file to contain (longer) frames that start and end 
on a sentence boundary; we achieved this by concatenating frames until a sentence ter¬ 
mination punctuation symbol is reached. This procedure was conducted on both Erench 
subtitles and their corresponding English translations. Then, we aligned the English- 
Erench parallel files at paragraph-level by alternated concatenation of paragraphs until 
synchronization of frame end time (up to a <5 threshold that was fixed to 500 millisec¬ 
onds) on the English and Erench sides. The paragraph-alignment procedure pseudo¬ 
code is detailed in Algorithm 1. 

We further aligned the paragraph-aligned TED and TEDx corpora at the sentence- 
level using the same sentence-alignment procedure [40]. TED talks tend to vary greatly 
in terms of sentence alignments (one-to-one, one-to-many, many-to-one, many-to-many) 
On average, approximately 10% of the alignments are not one-to-one; those were fil¬ 
tered out as well. 


4 Evaluation 

To validate the quality of the corpus we replicated the experiments of [18], who con¬ 
ducted a thorough exploration of supervised classification of translationese, using dozens 
of feature types. While [18] only used the Europarl corpus (in its original format) 
and worked on English translated from Erench, we extended the experiments to all 




Algorithm 1 TED subtitles paragraph-alignment algorithm 

Comment: Ijparagraphs and rjparagraphs are (not necessarily equal length) arrays of text 
paragraphs for alignment, from the left and right sides, respectively 

5 = 500 milliseconds > threshold controlling the allowed delta in aligned frames’ end time 
subtitles_paragraph_alignment(l,l) > initial invocation assuming the arrays start from 1 

procedure subtitles_paragraph_alignment(Lcounf,r_corinf) 

if (Ij^ount > 14 )aragraphs.length) and (recount > rjparagraphs.length) then 
return 

end if 

if (IjCount > ^paragraphs.length) then 

output r 4 )aragraphs[r jzount'.r.paragraphs.length) > remainder of the right side 
return 

end if 

if (r.count > rjparagraphs.length) then 

output 1.paragraphs[ljeount:l.paragraphs.length] > remainder of the left side 

return 

end if 

l.current = l.paragraphs[l.count].frame.content 
l.frame.end = l.paragraphs[lj:ount].frame.end 
r.current = r.paragraphs[rjeount].frame.content 
r.frame.end = r.paragraphs[rj:ount\.frame.end 

while (\l.frame.end - r.frame.end\ > 5 and (l.count < 1.paragraphs.length) 
and {r.count < r.paragraphs.length)) do 

if (l.frame.end > r.frame.end) then > advance on the right side 

r.count += 1; r.current += r.paragraphs[rj:ount].frame.content 
r.frame.end = r.paragraph.s[rj:ount].frame.end 
else > advance on the left side 

l.count += 1; l.current += l.paragraphs[ljeount].frame.content 
l.frame.end = l.paragraphs[ljy3unt].frame.end 
end if 
end while 

output “aligned paragraph pair:”, l.current, r.current 

subtitles_paragraph_alignment(Lcounf+l,r_cortnf+l) > recursive invocation 

end procedure 


the datasets described above, including also English translated from German, as well 
as Erench and German translations from English. We show that in-domain classifica¬ 
tion (with ten-fold cross-validation evaluation) yields excellent results. Moreover, very 
good results are obtained using unsupervised classification, implying robustness of this 
methodology and its applicability to various domains and languages. 

4.1 Preprocessing and tools 

The (tokenized) datasets were split into chunks of approximately 2000 tokens, respect¬ 
ing sentence boundaries and preserving punctuation. We assume that translationese fea¬ 
tures are present in the texts or speeches across author, genre or native language, thus 





we allow some chunks to contain linguistic information from two or more different 
speakers simultaneously. The frequency-based features are normalized by the number 
of tokens in each chunk. For POS tagging, we employ the Stanford implementation 
along with its models for English, French and German [42]. 

We use Platt’s sequential minimal optimization algorithm [43] to train a support 
vector machine classifier with the default linear kernel, an implementation freely avail¬ 
able in Weka [44]. In all classification experiments we use (the maximal) equal number 
of chunks from each class: original (O) and translated (T). 


4.2 Features 

The first feature set we utilized for classification tasks comprises/Mnct/on words (FW), 
probably the most popular choice ever since [45] used it successfully for the Federal¬ 
ist Papers. Function words proved to be suitable features for multiple reasons:(i) they 
abstract away from contents and are therefore less biased by topic; (ii) their frequency 
is so high that by and large they are assumed to be selected unconsciously by authors; 
(iii) although not easily interpretable, they are assumed to reflect grammar, and there¬ 
fore facilitate the study of how structures are carried over from one language to another. 
We used the list of above 400 English function words provided in [15], and similar 
number of French and German function words. 

A more informative way to represent (admittedly shallow) syntax is to use part-of- 
speech (POS) trigrams. Triplets such as PP (personal pronoun) + VHZ (verb “have”, 
3rd person sing, present) + VBN (verb “be”, past participle) reflect a complex tense 
form, represented distinctively across languages. In Europarl, for example, this triplet 
is highly frequent in translations from Finnish and Danish and much rarer in translations 
from Portuguese and Greek. 

We also used positional token frequency [46]. The feature is defined as counts of 
words occupying the first, second, third, penultimate and last positions in a sentence. 
The motivation behind this feature is that sentences open and close differently across 
languages, and it should be expected that these opening and closing devices will be 
transferred from the source if they do not violate the grammaticality of the target lan¬ 
guage. Positional tokens were previously used for translationese identification [18] and 
for native language detection [47]. 

Finally, we experimented with contextual function words. Contextual FW are a vari¬ 
ation of POS trigrams where a trigram can be anchored by specific function words: 
these are consecutive triplets {wi,W 2 ,wf) where at least two of the elements are func¬ 
tion words, and at most one is a POS tag. 

POS-trigrams, positional tokens and contextual-FW-trigrams are calculated as de¬ 
tailed in [18], but we only considered the 1000 most frequent feature values extracted 
from each dataset. 


The list of French and German FW was downloaded from https : //code . google . com/ 
archive/p/stop-words/. 



4.3 Results 


Supervised identification of translationese We begin with supervised identification 
of translated text using features detailed in Section 4.2'®; table 3 reports the results. 
Total number of chunks used for classification is reported per dataset, where we used 
the maximum available amount of data (up to 1000 chunks). 


Table 3. Ten-fold supervised cross-validation classification (rounded) accuracy of English, 
French and German translationese; the best result in each column is boldfaced. 



EN(0)-tFR^EN 

|EN(0)-l-DE^EN| 

FR(0)-l-EN->FR 

|DE(0)-hEN-^DE 

feature / corpus 

EUR HAN FIT TED 

|EUR LIT 

POL 

EUR HAN LIT TED 

|EUR LIT 

POL 

total # of chunks 

IK 

IK 

400 

40 

IK 

650 

100 

IK 

IK 

400 

40 

IK 

600 

90 

FW 

96 

99 

99 

90 

96 

95 

100 

96 

94 

93 

93 

99 

96 

99 

pos. tokens 

97 

96 

96 

93 

93 

93 

99 

96 

98 

95 

96 

98 

92 

100 

POS-trigrams 

98 

99 

98 

94 

97 

94 

100 

95 

88 

95 

94 

98 

93 

99 

contextual FW 

94 

98 

93 

90 

92 

89 

98 

96 

95 

94 

98 

94 

83 

90 


In line with previous works, the classification results are very high, yielding near¬ 
perfect accuracy with all feature types across all datasets. Close inspection of highly 
discriminative feature values sheds interesting light on the realization of unique char¬ 
acteristics of translationese across languages and domains; we leave this discussion for 
another venue. 

Supervised classification methods, however, suffer from two main drawbacks: 

(i) they inherently depend on data annotated with the translation direction, and (ii) they 
may not be generalized to unseen (related or unrelated) domains. Indeed, series of works 
on supervised identification of translationese reveal that classification accuracy dramat¬ 
ically deteriorates when classifier is evaluated out-of-domain (i.e., trained and tested on 
texts drawn from different corpora): [15,38,39,19] demonstrated significant drop in the 
accuracy of classification when one of the parameters (genre, source language, modal¬ 
ity) was changed. These shortcomings undermine the usability of supervised methods 
for translationese identification in a typical real-life scenario, where no labelled in¬ 
domain data are available. 


Unsupervised identification of translationese To overcome the domain- and labeled- 
data-dependence of supervised classification we experiment in this section with un¬ 
supervised methods. We adopt the approach detailed in [19], who demonstrated high 
accuracy identifying English translationese by clustering methodology. 

Table 4 demonstrates the results; the reported numbers reflect average accuracy over 
30 experiments (the only difference being a random choice of the initial conditions).*^ 
Europarl and Hansard systematically obtain very high accuracy with all feature types 

Feature combinations yield similar, occasionally slightly better, results; we refrain from pro¬ 
viding full analysis in this paper. 

Standard deviation in most experiments was close to 0. 






















(with a single exception of FW for French Hansard), implying uniform distribution of 
other linguistic aspects (authorship, topic, modality, epoch etc.) in these sub-corpora, 
thus facilitating the unsupervised procedure of clustering, since the text translation sta¬ 
tus dominates other dimensions. 


Table 4. Clustering (rounded) accuracy of English, French and German translationese; the best 
result in each column is boldfaced. 



EN(0)-hFR-^EN 

|EN(0)-l-DE^EN| 

FR(0)-l-EN->FR 

|DE(0)-tEN-^DE 

feature / corpus 

EUR HAN LIT TED 

|EUR LIT 

POL 

EUR HAN LIT TED 

|EUR LIT 

POL 

total # of chunks 

IK 

IK 

400 

40 

IK 

650 

100 

IK 

IK 

400 

40 

IK 

600 

90 

FW 

92 

91 

77 

89 

95 

70 

100 

91 

71 

72 

95 

96 

68 

98 

pos. tokens 

87 

95 

55 

67 

80 

64 

99 

97 

86 

80 

83 

94 

68 

99 

POS-trigrams 

97 

94 

71 

61 

94 

67 

100 

95 

79 

60 

85 

95 

68 

99 

contextual FW 

87 

96 

78 

70 

88 

67 

98 

96 

91 

72 

98 

95 

64 

89 


A notably high accuracy is obtained on the small TED corpus, which implies the 
applicability of the clustering methodology to data-meager scenarios. The exceptionally 
high accuracy achieved by unsupervised procedure on the politics dataset (both English 
and German, across all feature types) may indicate existence of additional artifacts (e.g., 
subtle topical differences) that tease apart O from T, thus boosting the classification 
procedure. We leave this investigation for the future work. 

We explain the lower precision achieved on the literature corpus by its unique char¬ 
acter: translators of literary works enjoy more freedom, rendering the translated texts 
more similar to original writing. Yet, clustering with EW systematically yields a rea¬ 
sonable accuracy for the literature datasets as well. We therefore, conclude that EW 
comprise one of the best-performing and most-reliable features for the task of unsuper¬ 
vised identification of translationese. 


Sensitivity analysis Next we tested (supervised and unsupervised) classifiers’ sensi¬ 
tivity by varying the number of chunks that are subject to classification. We used EW 
(one of the best performing, content-independent features) in these experiments. We 
excluded TED and politics datasets from these experiments due to their small size; 
the results for the literature corpus are limited by the amount of available data in this 
dataset. Eigures 1 and 2 report supervised and unsupervised classification accuracy as 
function of number of chunks used for this task. 

Supervised classification accuracy remains stable when the number of chunks used 
for classification decreases. Evidently, as few as 200 (100 on each side) chunks are suf¬ 
ficient for excellent classification in most cases. Clustering results demonstrate similar 
pattern: the vast majority of datasets preserve perfectly stable performance when the 
number of chunks decreases. A single exception is Hansard Erench (O H- T from En¬ 
glish), that exhibits results with considerable variance; we attribute these fluctuations to 
the random choice of samples, subject for clustering. 
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Fig. 1. Supervised classification accuracy as function of number of chunks using function words. 
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Fig. 2. Clustering accuracy as function of number of chunks using function words. 

Unsupervised classification is inherently sensitive procedure, thus the stable accu¬ 
racy obtained by the majority of sub-corpora implies high reliability and applicability 
of the clustering procedure to scenarios where only little data are available. 

5 Conclusion 

We present diverse parallel bilingual English-French and English-German corpora with 
accurate indication of the translation direction. To evaluate the quality of the corpus, 
we carried out series of experiments across all sub-corpora, using both supervised and 
unsupervised methodologies and various feature types. This is the first work (to the best 
of our knowledge) employing unsupervised classification across multiple languages and 
diverse registers, and the encouraging results stress the applicability of this methodol¬ 
ogy, leveraging further research in this field. 




















































It has been shown in a series of works [14,48-50,33,51] that awareness to trans- 
lationese has a positive effect on the quality of SMT. Parallel resources presented in 
this work enable exploitation of bilingual information for the task of identification of 
translationese. More precisely, the datasets that we compiled can be used for the task of 
identifying the translation direction of parallel texts; task that enjoys growing interest 
in recent years [23,25]. 

The potential value of this work leaves much room for further exploratory and prac¬ 
tical activities. Our future plans include extending this set of corpora to additional do¬ 
mains and languages, as well as exploitation of bilingual information for highly accurate 
identification of translationese at small text units, eventually, at the sentence level. 
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